Exploring Various Datasets of Bay Area and Finding Useful Insights for Policymakers

R
Spatial Analysis
AIRBNB
Data Cleaning
RESTFul API
Interactive Maps and Geodemographic Clustering with K-Means classfication method.
Author

Jay T. Pentapati

Published

January 10, 2024

Note:

Please click ‘Go Back’ Button for Central Page

1 Installing Packages

Below code will check whether packages listed within variable (packages_required) are installed and not. If not installed, It will automatically installs and loaded all required packages for the Analysis.

Code for Installing all required R packages.
packages_required <- c("tidycensus", "tigris","tidyverse", "data.table","sf", "tmap", "tmaptools","readr", "geojsonsf", "osmdata","devtools", "RColorBrewer","classInt","R.utils","dplyr", "ggplot2", "viridis","raster","terra","exactextractr", "tidyterra", "tibble", "rosm", "spdep", "cluster", "GGally", "rgeoda", "tidyr", "gridExtra","patchwork") 
# Include any additional packages needed, in above variable .

new_req_packages <- packages_required[!(packages_required %in% installed.packages()[,"Package"])] # checking which packages are already installed and filtering out installed packages, new variable is assigned for names of uninstalled packages.
if(length(new_req_packages)) install.packages(new_req_packages) # packages(not installed yet) are being installed. 

for(i in 1:length(packages_required)) {
  library(packages_required[i], character.only = T)        # loading all packages named in packages_required
}
# Below message is differently coded output for checking installed packages, not the output of above line

[1] “All required packages for Assignment-2 are installed and loaded properly.”

2 Introduction

2.1 Rationale for CRS

According to USGS Website, all maps in USA uses some projection in the State Plane Coordinate System (United States Geological Survey, 2023). It is plane coordinate system in which each state is divided up to six zones, depending on geometry of the state (United States Geological Survey, 2023). These zones use a Mercator or Lambert conic projection which preserves the shape and scale over smaller areas like zcta’s or tracts. 

By looking into Map of USA State Plane Zones NAD83 in ArcGIS website (ArcGIS Hub, 2020), we can identify that Bay Area is in California Zone 3. Therefore, we are going to use EPSG: 2227 – NAD 83 California zone 3 (ftUS).  

2.2 Description of Data

Airbnb Listings Data of San Francisco, Oakland and San Mateo are extracted from Inside Airbnb Website. Chosen Variables from the list of socio-economic metrics of the ACS are Median Household Income (B19013_001E) and No. Of People aged 18-64 who has computer and Broadband Internet Subscription (B28005_011E). We also use Bay Area Zip Code Areas.

For additional Datasets, we are going to use No. of Vehicles (light-duty) 2020 per zip code and Average Electricity Utility Rate 2020 (Commerical Rate) per zip code, both investors owned and non-investor owned utilities. 

Note: API Key can be obtained by signing up at http://api.census.gov/data/key_signup.html for Census Data. We, then use tidycensus package to get ACS data from US Census Bureau.

Code for setting up API Key
census_api_key(J_API_KEY , install = TRUE, overwrite = TRUE) # Please replace 'J_API_KEY' with API Key obtained from census bureau
# Below message is differently coded output for checking system environment variable(CENSUS_API_KEY), not the output of above line

[1] “API key to get Census data is successfully set up.”

2.3 Data Extraction and Pre-Processing

We are making two API calls to retrieve Median Household Income (B19013_001E) and the No. of people aged 18-64 who have a computer and Broadband Internet Subscription (B28005_011E) from ACS 2016-2020 separately. This approach is more efficient and straight-forward than making a single API call with two variables. The latter can results in a spatial or non-spatial file with a long-format attribute table, which, in turn, needs to be converted into a wide-format table for analysis.

We are extracting values of two variables for entire US. Then Later, we filter out zip codes specific to Bay Area. This is because, some califorina zcta’s extends into other states, Which results in error while extracting zcta’s related data for specific states (‘zip-Determining which US zipcodes map to more than one state or more than one city?-Geographic Information Systems’, 2021). According to US Census Bureau, All ZCTA’s codes are valid US Zip Codes(United States Census Bureau, 2023).

  • Spatial Resolution: zcta’s level
  • Time: 2020
Cleaning and pre-processing Data
median_hh_income <- get_acs(  # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureau
  geography = "zcta",         # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc.
  variables = "B19013_001E",  # we are geeting MEDIAN HOUSEHOLD INCOME, "B19013_001E" is variable name in ACS
  year = 2020,      # we are getting census data collected from 2016-2020
  geometry = FALSE  # we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data
) %>%
  rename(estimate_MedianHhI = estimate) %>%   # we are renaming retrieved variable(estimate to appropriate name ( estimate_Median_HhI)
  select(GEOID, estimate_MedianHhI)           # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)

# Retrieving Internet Use Variable
internet_use <- get_acs(      # get_acs is a function, we are using to get ACS Suvrey data from US Census Bureau 
  geography = "zcta",         # we are specifying geography unit for variable we getting, like, ZCTA, Tract, County, State etc.
  variables = "B28005_011E",  # we are geeting No.of People aged 18-64, who has computer and Broadband Internet Subscription, "B28005_011E" is variable name in ACS
  year = 2020,                # we are getting census data collected from 2016-2020
  geometry = FALSE            #  we use geometry of bay area zipcode shapefile from berkeley library, not ACS Data
) %>%
  rename(estimate_InternetUse = estimate) %>%  # we are renaming retrieved variable(estimate to appropriate name ( estimate_Internet_Use)
  select(GEOID, estimate_InternetUse)          # we are selecting only GEOID(ZCTA's CODE, also ZIP CODE), estimate Value (ignoring MOE)

us_zip <- left_join(median_hh_income, internet_use, by = "GEOID") # simply join two datasets by GEOID to get Dataset in req format  
Code for Showing first six rows of Data being used
head(us_zip)  #showing first six rows of Dataset
# A tibble: 6 × 3
  GEOID estimate_MedianHhI estimate_InternetUse
  <chr>              <dbl>                <dbl>
1 00601              14398                 6839
2 00602              16771                10955
3 00603              15786                21193
4 00606              14980                 1443
5 00610              20167                 7742
6 00612              19402                27116

We downloaded the Bay Area listings data (San Francisco, Oakland and San Mateo) from Airbnb Website and Bay Area Zip-Code Areas from Berkeley Library. Loading all .csv files by using read.csv function from utils package (Loading process can be found in source code of this file).

Two Additional Datasets:

We will be using Commerical Electricity Rates Dataset for Both investor-owned and non-investor owned utilities per ZIP Code.

Code for Data cleaning and Pre-processing of Electricity rates data
non_iou_data_c <- non_iou_data %>%  # Data cleaning process for additional datasets, 
  select(zip, comm_rate) %>%       # we select only zip, commerical rate from non-investor based utilities
  group_by(zip) %>%                # we group them by ZIP
  summarise(
    Avg_Non_IOU_Comm_Rate = mean(comm_rate, na.rm = TRUE)  # Mean of different Non-Investor Based utilities per ZIP
  )
  

iou_data_c <- iou_data %>%
  select(zip, comm_rate) %>%      # we select only zip, commerical rate from Investor based utilities
  group_by(zip) %>%               # we group them by ZIP
  summarise(
    Avg_IOU_Comm_Rate = mean(comm_rate, na.rm = TRUE)  # Mean of different Investor Based utilities per ZIP
  )

electric_u <- non_iou_data_c %>%
  full_join(iou_data_c, by = "zip")%>%   # Join both Investor-owned and Non-Investor utility datasets by zipcode
  mutate(
    Avg_IOU_Comm_Rate = replace_na(Avg_IOU_Comm_Rate, 0),    # replacing any NA values with 0
    Avg_Non_IOU_Comm_Rate = replace_na(Avg_Non_IOU_Comm_Rate, 0 ),  # replacing any NA values with 0
    Combined_Avg_Comm_Rate = (Avg_Non_IOU_Comm_Rate + Avg_IOU_Comm_Rate ) / 2  # using average as statistic for combined Commerical electricity Rate
  )%>%
  select(zip, Combined_Avg_Comm_Rate) %>% rename(ZIP = zip) %>% mutate(ZIP = as.character(ZIP)) # selecting only ZIP and New Statistic, ignoring rest

vc_califorina <- vehicle_count_2020 %>%
  select(-Date, -Model.Year, -Fuel, -Make) %>%     # we are only selecting Zip.code, Vehicles and Duty cols, and ignoring rest
  filter(Duty == "Light" & Zip.Code != "OOS") %>%  # we are filter only Light-Duty vehicles and removing Out of Service(OOS) locations from data
  group_by(Zip.Code) %>%                           # group by ZIP Code
  summarise(No_of_Light_Duty_Vehicles = sum(Vehicles)) %>% rename( ZIP = Zip.Code)  # summing up Light-Duty Vehicles per ZIP and then rename it

3 Methodology

3.1 No. of AIRBNB Listings with Mean Price per ZIP Code Areas

Bay Area Airbnb’s Listings, We downloaded have Longitude and Longitude coordinates, So initially, We set CRS of Listings to EPSG: 4326 - WGS84. Then, We perform CRS re-projection into EPSG: 2227 – NAD 83 California zone 3 (ft-US).

We then, perform spatial join for Bay Area ZIP Code Areas and AIRBNB Listings, which results in Overlay Shape-file. This will enable us to understand geographical distribution of AIRBNB Listings and their charactertics over Bay Area ZIP Code Areas.

Code for pre-processing of Shapefiles
listings_bayarea_sf <- listings_bayarea %>%
  st_as_sf(coords = c("longitude", "latitude")) %>% # creating point shapefile from coordinates
  st_set_crs(4326)                                  # setting CRS to geodetic system of earth (4326, WGS84) as they are long, lat coords

listings_bayarea_sf_transform <- st_transform(listings_bayarea_sf, crs = 2227)   # Now, Changing CRS to mentioned Project CRS
overlay_listings <- st_join(bayarea_zipcodes, listings_bayarea_sf_transform)     # Performing Spatial Join to create overlay Shapefile of Airbnb Listings and Bay area ZIP shapefile

st_crs(overlay_listings) == st_crs(listings_bayarea_sf_transform)                # Verifying CRS of New shapefile

[1] TRUE

We need to plot No. of Airbnb’s and their mean price per zipcode in Bay Area. To find out necessary variables, we group by Zip Code and count no. of non-NA price rows of listings in each zip code and use function(mean) to find mean price of those listings.

listings_bayarea_zip <- overlay_listings  %>%  # creating new dataframe with no.of Airbnb's per ZIP and Mean Price per ZIP
  group_by(ZIP) %>%                     # group by ZIP
  summarise(
    count_airbnb = sum(!is.na(price)),      # counting non-NA price rows for listings
    mean_price = mean(price, na.rm = TRUE)  # calculate mean, ignoring NAs
  )%>%
  filter(count_airbnb != 0) 
listings_bayarea_zip1 <- listings_bayarea_zip %>%
  select(count_airbnb, everything())  # Changing the order of attributes for interactive map

listings_bayarea_zip2 <- listings_bayarea_zip %>%
  select(mean_price, everything())%>% # Changing the order of attributes for interactive map
  mutate(mean_price = round(mean_price, 1))  # rounding mean price to one decimal place.

Based on standard statistics for both Count and Mean Price variables of AIRBNB’s per zip code, Minimum value of count variable is 1 and first quartile is 71, so first class is divided accordingly. Median of count variable is 172 and standard deviation is 162.5, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 257, standard deviation is 162.5 and Maximum Value is 794, last class has 500 as its lower boundary.

Minimum value of Mean Price variable is 71 and first quartile is 154, so first class is divided accordingly. Median of Mean Price is 204 and standard deviation is 297.9, so, second and third class is divided as defined. Finally to identify outliers, As Third quartile is 282.6, standard deviation is 297.9 and Maximum Value is 1818, last two classes have been divided 1000 and 1800 as their lower boundaries respectively.

New forms of spatial data like Airbnb data can offer latest snaphots of how people are are using spaces and leads to real-time insights. There will generally have high Heteroskedasticity and also provide detailed data at granular level such as individual Airbnbs. On other hand, New form of spatial data like Airbnb data never represent all segments of markets. Collection of Data can introduce measurement errors due to less standard methodologies.

Airbnb data is dynamic and reflects real-time changes in market, like social media or GPS data. Like social Media or GPS data, Airbnb data is user-generated. This is quite different from census-type of data.

Code for Maps of Count and Price of AIRBNBs per ZIP
tmap_mode("view")   # Interactive mode
map1.11 <- tm_shape(listings_bayarea_zip1) +  # listings_bayarea_zip1 is just re-arrangement of listings_bayareaa_zip with first column as count_airbnb
  tm_fill( "count_airbnb",style="fixed", title = "No. of Airbnb's per ZIP", alpha = 1, breaks=c( 1, 72, 172, 359, 500, Inf), palette = "-viridis")+ 
  tm_borders()+
  tm_text("ZIP", size = 0.3)+
  tm_view(set.view = c(center_lon, center_lat, zoom_level))

map1.12 <- tm_shape(listings_bayarea_zip2) +  # listings_bayarea_zip2 is just re-arrangement of listings_bayareaa_zip with first column as Mean Price
  tm_fill( "mean_price",style="fixed", title = "Mean Price of Airbnb per ZIP", alpha = 1, breaks=c( 74, 155, 287, 501,1000, 1800, Inf), palette = "-viridis")+
  tm_borders()+
  tm_text("ZIP", size = 0.3)+
  tm_view(set.view = c(center_lon, center_lat, zoom_level))

tmap_arrange(map1.11, map1.12)

3.2 Median Household Income and No.of People aged 18-64 who has Computer and Internet Subscription per ZIP Code Areas

We filter Median Household Income and Internet Use Variable for Bay Area Zip Codes. Then we join ACS Dataset to Bay Area ZIP Code Areas shapefile. As an additional step, we remove all ZIP codes where Internet Use variable is missing.

Code for Cleaning Attribute information in Shapefile
bay_area_zipcodes <- bayarea_zipcodes$ZIP   # Creating new var with Bayarea ZIP codes

bayarea_zip_vars_data <- us_zip %>%        # Filtering ACS Data with Two variables, by Bay Area ZIP codes
  filter(GEOID %in% bay_area_zipcodes)%>%  # Using Bay Area ZIP codes variable, created above
  rename(ZIP = GEOID)                     # renaming GEOID with ZIP

bayarea_zip_vars <- bayarea_zipcodes %>% left_join(bayarea_zip_vars_data, by = "ZIP") %>% filter(!(is.na(estimate_InternetUse))) # Joining two vars from ACS with bayarea_zipcodes shapefile which is downloaded from berkeley library and also removing NA values of Internet Use variable.
head(bayarea_zip_vars)
Simple feature collection with 6 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 5936840 ymin: 2232344 xmax: 6251037 ymax: 2506800
Projected CRS: NAD83 / California zone 3 (ftUS)
# A tibble: 6 × 8
  ZIP   PO_NAME   STATE       Area__ Length__                           geometry
  <chr> <chr>     <chr>        <dbl>    <dbl>    <MULTIPOLYGON [US_survey_foot]>
1 94558 NAPA      CA    12313263537   995176. (((6102909 2377436, 6102846 23769…
2 95620 DIXON     CA     7236949521.  441860. (((6230747 2302747, 6219259 23030…
3 95476 SONOMA    CA     3001414165.  311319. (((6013411 2248863, 6013202 22488…
4 94559 NAPA      CA     1194301745.  359105. (((6045939 2248058, 6044555 22480…
5 94533 FAIRFIELD CA      991786103.  200773. (((6146297 2299595, 6146371 22987…
6 94954 PETALUMA  CA     2006544443.  267474. (((5998504 2235043, 5991182 22339…
# ℹ 2 more variables: estimate_MedianHhI <dbl>, estimate_InternetUse <dbl>

We remove all ZIP codes where Median Household Income variable is missing.

Summary of Median Household Income per Zipcode (estimate_MedianHhI): 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  40813   89583  111037  122296  153256  250001 

Summary of No.of People aged 18-64 who has Computer and Internet Subscription 
per Zipcode (estimate_InternetUse): 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    8110   16235   17830   24914   59638 
Standard Deviation of Variable(estimate_MedianHhI):     45357 
Standard Deviation of Variable(estimate_InternetUse):   12765.53 

Based on standard statistics for both variables of ACS per zip code, Range and Standard Deviation for both variables are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.

Bay Area ZIP Code Areas is Mixed-Use Type of Neighborhood. Most of neighborhoods have medium Household Incomes and High Internet Subscriptions. Socio-Economic Indicators like Median Household Income and Internet Use helps in delineating this typology of Neighborhood. San Francisco and Oakland have more-affulent neighborhoods due to High Internet Usage and medium Household Income, whereas San Mateo have High Household Income Neighborhoods and Medium Internet Usage.

Based on the given Classification, Reasonable Hypothesis will be that AIRBNB would cluster in San Francisco and Oakland, due to High Internet Usage and Medium Household Income (Middle to High-Class Residential Areas). AIRBNB listings would also cluster in Areas with low commercial electricity rate and High No.of Vehicles.

Code for Maps of Median Household Income and Internet Use per ZIP
tmap_mode("plot")
map2.1 <- tm_basemap() +
  tm_shape(bayarea_zip_vars1) +    # shapefile for map 2
  tm_fill(col = "estimate_MedianHhI", palette = "YlGn", n = 5, style = "fisher", title = "Median Household Income (in $ USD)") + # estimate_MedianHhI
  tm_compass(position = c(0.85, 0.85)) +  # Adding Compass
  tm_scale_bar(position = c(0.625, 0.02)) + # Adding Scalebar
  tm_borders(lwd = 0.3)+      # Borders for zcta's polygons
  tm_layout(
    legend.title.size = 0.715,
    legend.text.size = 0.6, 
    inner.margins = c(0.22, 0.13, 0.02, 0.08), 
    legend.position = c(0.05, 0.02), 
    legend.width = 0.5, 
    bg.color = "#eaf5fa",
    main.title = "Median Household Income per ZIP Code",  # Add the map title within tm_layout
    main.title.size = 0.7,
    main.title.position = "center"
  )


map2.2 <- tm_basemap() +
  tm_shape(bayarea_zip_vars) +
  tm_fill(col = "estimate_InternetUse", palette = "YlGn", n = 5, style = "fisher", title = "Broadband Internet Connection (People)") +
  tm_compass(position = c(0.85, 0.85)) +
  tm_scale_bar(position = c(0.625, 0.02)) +
  tm_borders(lwd = 0.3)+
  tm_layout(
    legend.title.size = 0.715,
    legend.text.size = 0.6, 
    inner.margins = c(0.185, 0.13, 0.02, 0.08), 
    legend.position = c(0.05, 0.02), 
    legend.width = 0.5, 
    bg.color = "#eaf5fa",
    main.title = "No. of People aged (18-64) using Broadband Internet per ZIP Code",  # Add the map title within tm_layout
    main.title.size = 0.7,
    main.title.position = "center"
  )
tmap_arrange(map2.1, map2.2)

3.3 MAP 3: AIRBNB Listings and Internet Use Variable (Combining Datasets)

We change the order of attributes in Bay Area ZIP Codes as following:

We calculate Natural Logarithm of Price of Airbnb Listings in Overlay Shapefile and Checking CRS of both shapefiles:

Code for Calculating Log prices of AIRBNBs
listings_bayarea_sf_transform_copy <- listings_bayarea_sf_transform
listings_bayarea_sf_transform_copy$nl_price <- log(listings_bayarea_sf_transform_copy$price) # calculating natural logarithm of price ( log to base e)
st_crs(bayarea_zip_vars_map3) == st_crs(listings_bayarea_sf_transform_copy) # checking CRS of both shapefiles

[1] TRUE

We are changing the order of attribute in Overlay Listings Shapefile as following:

Based on standard statistics for Internet Usage(People) per zip code and Natural Logarithm of AIRBNB Price, Range and Standard Deviation are large, We use Fisher-Jenks Data Classification method, because of highly-skewed data.

AIRBNB Listings are majorly clustered in San Francisco, Daly City, South San Francisco, Burlingame, Foster City, Redwood City and Oakland areas(geographical observations from below map). High Natural Logarithm of Prices are found in San Francisco, West Coast Seaside and Redwood City(San Carlos).

Apparently in San Francisco, ZIP Code areas with High Internet Usage have High AIRBNB Listings. ZIP Code Areas with 5,500 people using internet, have high price for AIRBNB Listings.

Code for Maps of Internet Use per ZIP in San Francisco
tmap_mode("view") # interactive mode activating

map3 <- tm_shape(bayarea_zip_vars_map3) +
  tm_fill(col = "estimate_InternetUse", alpha = 1, palette = "YlGn", n = 5, style = "fisher",title = "Broadband Internet Connection(People)") +
  tm_borders() +
  tm_text("ZIP", size = 0.3)+
  tm_shape(listings_bayarea_sf_transform_copy) +
  tm_dots(col = "nl_price", palette = "Reds", n = 3, style = "fisher", title = "Natural Logarithm of Airbnb's Price") +
  tm_view(set.view = c(center_lon, center_lat, zoom_level))# Set legend position for interactive view

map3

3.4 MAP 4: Spatial Auto-Correlation

Calculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income:

Code for calculating spatial weights martrix for ZIP codes which have Median Household Income
nb_q1 <- poly2nb(bayarea_zip_vars1, queen = TRUE) # shapefile used, have no NA values of Median Household Income variable

w_queen1 <- nb2listw(nb_q1, style = "B", zero.policy=TRUE) # queen-contiguity based Weights for list of neighbors for all observations 
isolates1 <- which(w_queen1$neighbours == "0")
bayarea_zip_vars1_NoZero <- bayarea_zip_vars1[-c(isolates1),] # shapefile where there are no observation with zero neighbours based on queen-contiguity

Calculating list of neighbors(Neighborhoods) and Queen-based Spatial Weights Matrix for Bay Area ZIP Code areas which have Internet Usage:

Code for calculating spatial weights martrix for ZIP codes which have Internet Use
nb_q2 <- poly2nb(bayarea_zip_vars, queen = TRUE) # shapefile used, have no NA values of Broadband Internet Use

w_queen2 <- nb2listw(nb_q2, style = "B", zero.policy=TRUE)
isolates2 <- which(w_queen2$neighbours == "0")

bayarea_zip_vars_NoZero <- bayarea_zip_vars[-c(isolates2),] # shapefile where there are no observation with zero neighbours based on queen-contiguity

3.4.1 Global Spatial Auto-Correlation:

Calculating list of neighbors(Neighborhoods) and Queen-based Standardized Spatial Weights Matrix for Bay Area ZIP Code areas which have Median Household Income or/and Internet Usage(People) :

Code for calculating spatial weights martrix for ZIP codes which have Median Household Income or Internet Use
nb_q1 <- poly2nb(bayarea_zip_vars1_NoZero, queen = TRUE) # Constructing neighbours list from filtering observations which have no zero neighbours and no NA values of Median Household Income
w_queen_std1 <- nb2listw(nb_q1, style = "W") # creating spatial weights matrix using queen contiguity and row-standardardised weights

nb_q2 <- poly2nb(bayarea_zip_vars_NoZero, queen = TRUE) # Constructing neighbours list from filtering out observations which have zero neighbours and no NA values of Broadband Internet Use
w_queen_std2 <- nb2listw(nb_q2, style = "W") # creating spatial weights matrix using queen contiguity and row-standardardised weights

Calculating Spatial Lag for Median Household Income and Internet Usage(People):

Code for calculating spatial lag for ZIP codes
bayarea_zip_vars1_NoZero$sl_MedianHhI <- lag.listw(w_queen_std1, bayarea_zip_vars1_NoZero$estimate_MedianHhI) # calculating spatial lag for Medain Household Income Variable in Bay Area 

bayarea_zip_vars_NoZero$sl_InternetUse <-  lag.listw(w_queen_std2, bayarea_zip_vars_NoZero$estimate_InternetUse) # calculating spatial lag for Internet Use Variable in Bay Area 

Moran I’s for Median Household Income in Bay Area ZIP Code Areas:

Code for calculating Moran’s I statitstic for Median Household Income
moran.mc(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="greater") # Moran's I statistic for Median Household Income Variable in Bay Area

    Monte-Carlo simulation of Moran I

data:  bayarea_zip_vars1_NoZero$estimate_MedianHhI 
weights: w_queen_std1  
number of simulations + 1: 1001 

statistic = 0.39557, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater

Moran I’s for Median Household Income in Bay Area is 0.396 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. P-value means chances of obtaining Moran I’s value greater than original, given random spatial distribution. If it is less than 0.05, We reject Null Hypothesis i.e, observed spatial pattern is due to random chance. Moran I’s is analogous to Chi-Square Statistic. Spatial distribution of Median Household Income is spatially concentrated than random spatial distribution.

Moran I’s for Internet Usage(People) in Bay Area ZIP Code Areas:

Code for calculating Moran’s I statitstic for Internet Use
moran.mc(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="greater") # Moran's I statistic for Interent Use Variable in Bay Area

    Monte-Carlo simulation of Moran I

data:  bayarea_zip_vars_NoZero$estimate_InternetUse 
weights: w_queen_std2  
number of simulations + 1: 1001 

statistic = 0.23313, observed rank = 1001, p-value = 0.000999
alternative hypothesis: greater

Moran I’s for Internet Usage (People) in Bay Area is 0.23313 and p-value is 0.000999. So, Observed Spatial Pattern is statistically Significant. Spatial distribution of Internet Usage (People) is mild spatially concentrated than random spatial distribution.

3.4.2 Local Spatial Auto-Correlation:

Calculating LISAs for Bay Area ZIP COde Areas for both Median Household Income and Internet Usage(People):

Code for calculating LISA statistics
lisa_perm1 <- localmoran_perm(bayarea_zip_vars1_NoZero$estimate_MedianHhI, w_queen_std1, nsim=1000, alternative="two.sided") # Lisa Clustering for Median Household Income in Bay Area

lisa_perm2 <- localmoran_perm(bayarea_zip_vars_NoZero$estimate_InternetUse, w_queen_std2, nsim=1000, alternative="two.sided") # Lisa Clustering for Internet Use Variable in Bay Area

Calculating Quadrants where local moran I’s for observation, with significance cutoff at 0.2. As there are no quadrants at significance level 0.1, it is set by seeing to have least no.of levels for at least one variable:

Code for calculating LISA quadrants
quadrants1 <- hotspot(lisa_perm1, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Median Household Income

quadrants2 <- hotspot(lisa_perm2, Prname="Pr(z != E(Ii)) Sim", cutoff=0.2) # Creating quadrants from significance values of local Moran's I statistic calculated from Internet Use Variable
Levels of Quadrant 1 :
Low-Low, High-Low, Low-High, High-High


Levels of Quadrant 2 :
Low-Low, High-Low, Low-High, High-High

Adding Quadrants back to Bay Area ZIP Code Areas Shapefile:

Code for adding new categorical variable back to shapefile
bayarea_zip_vars1_NoZero$quadrant1 <- as.character(quadrants1)  %>% replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Median Household Income) where no observation with zero neighbours based on queen-contiguity


bayarea_zip_vars_NoZero$quadrant2 <- as.character(quadrants2)  %>% replace_na("Not significant") # Adding created quadrants to shapefile(no NA values for Internet Use Variable) where no observation with zero neighbours based on queen-contiguity

To view complete chunks, see Source code.

As There were no quadrants when cutoff is at 0.1, We set cutoff at 0.2 to have minimum no. of Quadrants for atleast one variable. These LISA clusters are likely to arranged by random chance. It means Observed Clustering at whole Bay Area Level has Spatial Structure, whereas LISA Clusters observed are, may be due to random chance. Observed Spatial Pattern in LISA clusters is not statistically significant.

But, also, It is very highly unlikely to say that observed spatial pattern in LISA clusters is just by random chance. This requires further and more advanced analysis.

Code for creating LISA maps
tmap_mode("plot")
lisa_map_cluster1 <- borders1 +
  hh_map1 + ll_map1 + hl_map1  +lh_map1  + ns_map1 +   # combining all quadrant maps
  tm_compass(position = c(0.85, 0.85)) +
  tm_scale_bar(position = c(0.625, 0.02)) +
  tm_add_legend(type = "fill", col = c("royalblue2", "red2", "darkgreen", "gold", "lightgrey"), 
                labels = c("High-High", "Low-Low", "High-Low", "Low-High", "Not significant"), title = "LISA cluster(Median Household Income)") +
  tm_layout(
    legend.title.size = 0.715,
    legend.text.size = 0.6, 
    inner.margins = c(0.22, 0.13, 0.02, 0.08), 
    legend.position = c(0.05, 0.02), 
    legend.width = 0.5, 
    bg.color = "#eaf5fa",
    main.title = "LISA Clusters for Median Household Income Variable", 
    main.title.size = 0.7,
    main.title.position = "center"
  )


lisa_map_cluster2 <- borders2 +
  hh_map2 + ll_map2 + hl_map2   + ns_map2 +  #  Combining All quadrant maps
  tm_compass(position = c(0.85, 0.85)) +
  tm_scale_bar(position = c(0.625, 0.02)) +
  tm_add_legend(type = "fill", col = c("royalblue2", "red2", "darkgreen", "lightgrey"), 
                labels = c("High-High", "Low-Low", "High-Low", "Not significant"), title = "LISA cluster(Broadband Internet Use)") +
  tm_layout(
    legend.title.size = 0.715,
    legend.text.size = 0.6, 
    inner.margins = c(0.185, 0.13, 0.02, 0.08), 
    legend.position = c(0.05, 0.02), 
    legend.width = 0.5, 
    bg.color = "#eaf5fa",
    main.title = "LISA Clusters for Broadband Internet Use Variable",  
    main.title.size = 0.7,
    main.title.position = "center"
  )
tmap_arrange(lisa_map_cluster1, lisa_map_cluster2)

3.5 Geo-Demographic Classification in Bay Area

We create overlay shapefile by spatial join of Bay Area ZIP Code Areas and AIRBNB Listings. Then, we group by to calculate all variables. Also, Join Average Commercial Electricity rate and No.of Light-Duty Vehicles per Zip Code.

Code for calculating dataframe required for Geodemographic Classification
overlay_listings <- st_join(bayarea_zip_vars, listings_bayarea_sf_transform,  join = st_intersects) # spatial join to create Overlay Shapefiles

overlay_listings_zip <- overlay_listings  %>% 
  group_by(ZIP) %>% 
  summarise(
    count_airbnb =    sum(!is.na(price)),  # count non-NA price rows for count
    mean_price =      mean(price, na.rm = TRUE), # Mean price 
    mean_review  =    mean(number_of_reviews, na.rm = TRUE), # No. of Reviews
    mean_min_nights = mean(minimum_nights, na.rm = TRUE),   # No. of Minimum Nights stay
    Median_HhI =      mean(estimate_MedianHhI, na.rm = TRUE), # Median Household income
    Internet_Use =    mean(estimate_InternetUse, na.rm = TRUE) # Internet Usage
  )%>%
  filter(count_airbnb != 0) 

overlay_listings_zip <- overlay_listings_zip %>% left_join(vc_califorina, by = "ZIP") # Join overlay listings with Light-Duty Vehicles count
overlay_listings_zip <- overlay_listings_zip %>% left_join(electric_u, by = "ZIP") #Joining Overlay Listings with Commerical Electricity Rate.

overlay_listings_zip <- na.omit(overlay_listings_zip) # removing NA values

We create 8-tuple Vector for multi-variate classification:

Code for sub-setting 8-tuple vector
airbnb_ra <- c('count_airbnb', 'mean_price', 'mean_review', 'mean_min_nights', 'Median_HhI', 'Internet_Use', 'No_of_Light_Duty_Vehicles', 'Combined_Avg_Comm_Rate') # geodemographic arguments - (8 - tuple) , multivariate classification

Subsetting Overlay Listings Shapefile without geometry attribute:

Code for sub-setting attribute information from shapefile
overlay_listings_zip_NoGeo <- st_drop_geometry(overlay_listings_zip[, airbnb_ra,  drop=FALSE]) # we are dropping geometry col in shp

overlay_listings_zip_NoGeo <- overlay_listings_zip_NoGeo[complete.cases(overlay_listings_zip_NoGeo), ] # removing NA Values

K-Means Clustering:

set.seed(12345)  # setting seed

k6cls <- kmeans(overlay_listings_zip_NoGeo, centers=6, iter.max = 1000) # KNN clusteing - 6 clusters (Each Observation will be 8-tuple )
overlay_listings_zip$k6cls <- k6cls$cluster   # Type of cluster assigned bacnk to overlay listings
k6cls$centers # centres of 6 clusters, six 8-tuple vectors defines centre of each cluster
  count_airbnb mean_price mean_review mean_min_nights Median_HhI Internet_Use
1     279.0000   179.9462    53.98208        6.132616  106352.00     42383.00
2     197.1875   278.3737    54.32059       15.048195  171773.62     17882.00
3      48.0000   327.6437    35.09278       14.406388  242917.50      6848.00
4     179.5385   438.6613    30.89614       15.389951   57687.23     15399.62
5     202.0000   306.7908    59.36053       13.594888  128579.62     20916.00
6     256.8235   174.3755    51.10254       14.275260   96413.88     26327.35
  No_of_Light_Duty_Vehicles Combined_Avg_Comm_Rate
1                 126455.00             0.09763171
2                  19991.88             0.09186155
3                  10437.00             0.13166995
4                  14125.69             0.10183107
5                  21415.62             0.09796353
6                  25233.29             0.11473432

By observing centers of different clusters, Main types of Neighborhoods can be identified as Tourist-focused areas, Residential Areas, Economically-diverse areas and Vehicle-dependent areas etc. Tourist-Focused Areas can be identified by High Count, High Internet Usage, Low Median Household Income. Residential Areas can be identified by High Median Household Income, High Internet Usage, High No.of Light-duty Vehicles etc. Economically Diverse ares can be identified by mediummedian household income (Median_HhI) and internet usage (Internet_Use). Vehicle-dependent areas can be identified by High Vehile count.

Characteristics helping to delineate typology are AIRBNB Metrics(‘count_airbnb’, ‘mean_price’, ‘mean_review’, ‘mean_min_nights’), Median Household Income, Internet Usage(People), No. of Light-Duty vehicles, Average Electricity Commercial Rate.

Using this Classification, We can target areas most in need, by, like supporting Economically weak Neighborhoods which can be identified by Low Median Household Income, Internet Usage(People) for Development Programs and Infrastructure Improvements. For Managing Tourist Activity, Neighborhoods with high Airbnb count, urban planning committees can focus on minimizing impact of Tourism on local communities, like affordable housing. For Transportation and urban planning, Neighborhoods with High Vehicle count might benefit from improved Public Transportation and Development of Local Amenities to reduce travelling. For Community Services like Digital Hubs, Which might improve overall livelihood in Neighborhoods with Low Internet Usage and Varied Median Hosehold Income.

Code for creating Geo-demographic clustering map
tmap_mode("plot")

map_cluster1 <- tm_basemap() +
  tm_shape(overlay_listings_zip) +
  tm_fill(col = "k6cls", palette = viridis(256), style = "cont", title = " Type of Cluster") +
  tm_compass(position = c(0.85, 0.88)) +
  tm_scale_bar(position = c(0.58, 0.02)) +
  tm_borders(col = "white", lwd = 0.2)+
  tm_layout(
    legend.title.size = 0.7,
    legend.text.size = 0.6, 
    inner.margins = c(0.2, 0.13, 0.02, 0.08), 
    legend.position = c(0.05, 0.02), 
    legend.width = 0.5, 
    bg.color = "#eaf5fa",
    main.title = "Type of Clusters in Bay Area",  # Add the map title within tm_layout
    main.title.size = 0.7,
    main.title.position = "center"
  )

map_veh <- tm_basemap() +
  tm_shape(overlay_listings_zip) +
  tm_fill(col = "No_of_Light_Duty_Vehicles", palette = "YlGn", n = 5, style = "fisher", title = "Vehicles per ZIP") +
  tm_compass(position = c(0.85, 0.88)) +
  tm_scale_bar(position = c(0.58, 0.02)) +
  tm_borders(col = "black", lwd = 0.2)+
  tm_layout(
    legend.title.size = 0.7,
    legend.text.size = 0.6, 
    inner.margins = c(0.185, 0.13, 0.02, 0.08), 
    legend.position = c(0.05, 0.02), 
    legend.width = 0.5, 
    bg.color = "#eaf5fa",
    main.title = "No. of Vehicles(Light-Duty) in Bay Area",  # Add the map title within tm_layout
    main.title.size = 0.7,
    main.title.position = "center"
  )


map_electric <- tm_basemap() +
  tm_shape(overlay_listings_zip) +
  tm_fill(col = "Combined_Avg_Comm_Rate", palette = "YlGn", n = 5, style = "fisher", title = "Commerical Electrical Rate") +
  tm_compass(position = c(0.85, 0.88)) +
  tm_scale_bar(position = c(0.58, 0.02)) +
  tm_borders(col = "black", lwd = 0.2)+
  tm_layout(
    legend.title.size = 0.7,
    legend.text.size = 0.6, 
    inner.margins = c(0.185, 0.13, 0.02, 0.08), 
    legend.position = c(0.05, 0.02), 
    legend.width = 0.5, 
    bg.color = "#eaf5fa",
    main.title = "Commercial Electrical Rate in Bay Area",  # Add the map title within tm_layout
    main.title.size = 0.7,
    main.title.position = "center"
  )
tmap_arrange(map_cluster1, map_veh, map_electric, nrow =1, ncol= 3)

4 Conclusion

Geo-Demographic Classification allows for making data-informed decisions for various policy making and Societal Improvements. In Reality, It is much more nuanced, we need to have understanding of different neighborhood dynamics, real-estate markets and other Socio-Economic Factors. But, With target areas based on neighborhood charactertistics, Policymakers can very effectively address local issues and enhance overall community well-being.

Though Multi-Variate Clustering is performed, we neglected timeline factor, which introduces temporal inconsistencies. There can be biases introduced because of New forms of spatial data like AIRBNB Data. However, Apart from above insights in geodemographic classifcation, To make well-informed decisions for policymakers, We need more sophisticated approaches and Real-Time Verification procedures.

Note:

I apologize for any references I may have inadvertently forgotten to mention in this case study. All my work is built upon the invaluable contributions of the giants in the data science field, on whose shoulders I stand. I am deeply grateful for their insights and advancements, which have made this study possible.

5 References

  1. GIS Geography. 2023. Choropleth Maps - A Guide to Data Classification. [online] Available at: https://gisgeography.com/choropleth-maps-data-classification/ (Accessed : 08 January 2024).

  2. Ross, S. (2009) Introduction to Probability and Statistics for Engineers and Scientists. 4th edn. London: Elsevier Academic Press.

  3. United States Geological Survey (2023) What is the State Plane Coordinate System? Can GPS provide coordinates in these values?. Available at: https://www.usgs.gov/faqs/what-state-plane-coordinate-system-can-gps-provide-coordinates-these-values#:~:text=The%20State%20Plane%20Coordinate%20System%20(SPCS)%2C%20which%20is%20only,the%20state%27s%20size%20and%20shape. (Accessed: 08 January 2024).

  4. United States Census Bureau (2023) ZIP Code Tabulation Areas (ZCTAs). Available at: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html (Accessed: 08 January 2024).

  5. ArcGIS Hub (2020) USA State Plane Zones NAD83. Available at: https://hub.arcgis.com/datasets/esri::usa-state-plane-zones-nad83/about (Accessed: 08 January 2024).

  6. ‘zip-Determining which US zipcodes map to more than one state or more than one city?-Geographic Information Systems’ (2021) Geographic Information Systems Stack Exchange. Available at: https://gis.stackexchange.com/questions/53918/determining-which-us-zipcodes-map-to-more-than-one-state-or-more-than-one-city (Accessed: 08 January 2024).

  7. UC Berkeley GeoData Repository (2004) Bay Area ZIP Code Areas[Dataset]. Available at: https://geodata.lib.berkeley.edu/catalog/ark28722-s7888q (Accessed: 08 January 2024).

  8. California Open data (2023) Vehicle Fuel Type Count by Zip Code[Dataset]. Available at: https://data.ca.gov/dataset/vehicle-fuel-type-count-by-zip-code (Accessed: 08 January 2024).

  9. National Renewable Energy Laboratory (2022) U.S. Electric Utility Companies and Rates: Look-up by Zipcode (2020)[Dataset]. Available at: https://catalog.data.gov/dataset/u-s-electric-utility-companies-and-rates-look-up-by-zipcode-2020 (Accessed: 08 January 2024).